Boardgame Rating Prediction

Posted on Dim 23 septembre 2018 in Machine Learning

Predict the Rating for Board Games

The data set that contains 80000 board games with game informations and their associated review scores. I'm going to predict average_rating using the other columns.

In [2]:
import pandas as pd
board_games = pd.read_csv("board_games.csv")
board_games.head()
Out[2]:
id type name yearpublished minplayers maxplayers playingtime minplaytime maxplaytime minage users_rated average_rating bayes_average_rating total_owners total_traders total_wanters total_wishers total_comments total_weights average_weight
0 12333 boardgame Twilight Struggle 2005.0 2.0 2.0 180.0 180.0 180.0 13.0 20113 8.33774 8.22186 26647 372 1219 5865 5347 2562 3.4785
1 120677 boardgame Terra Mystica 2012.0 2.0 5.0 150.0 60.0 150.0 12.0 14383 8.28798 8.14232 16519 132 1586 6277 2526 1423 3.8939
2 102794 boardgame Caverna: The Cave Farmers 2013.0 1.0 7.0 210.0 30.0 210.0 12.0 9262 8.28994 8.06886 12230 99 1476 5600 1700 777 3.7761
3 25613 boardgame Through the Ages: A Story of Civilization 2006.0 2.0 4.0 240.0 240.0 240.0 12.0 13294 8.20407 8.05804 14343 362 1084 5075 3378 1642 4.1590
4 3076 boardgame Puerto Rico 2002.0 2.0 5.0 150.0 90.0 150.0 12.0 39883 8.14261 8.04524 44362 795 861 5414 9173 5213 3.2943

Cleaning

In [3]:
board_games.dropna(axis=0, inplace = True)
board_games = board_games[board_games['users_rated'] > 0]

Data Exploration

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(board_games['average_rating'])
plt.show()
plt.boxplot(board_games['average_rating'])
plt.show()

std = board_games['average_rating'].std()
mean = board_games['average_rating'].mean()

print(std)
print(mean)
1.57882993483
6.01611284933

Error Metric

The distribution follow a normal distribution, so we can use mean squared error as an error metric

Clustering

In [5]:
from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters = 5, random_state=1)
numeric_columns = board_games.iloc[:,3:]
kmeans_model.fit(numeric_columns)
labels = kmeans_model.labels_

import numpy
game_mean = numeric_columns.apply(numpy.mean, axis=1)
game_std = numeric_columns.apply(numpy.std, axis=1)
    
plt.scatter(x = game_mean, y = game_std, c = labels)
plt.show()

It looks like most of the games are similar, 4 clusters are between mean = 0 and and mean = 4000

Finding Correlations

Remove columns that don't add predictive power to the model.

In [6]:
correlations = board_games.corr()
print(correlations['average_rating'])
id                      0.304201
yearpublished           0.108461
minplayers             -0.032701
maxplayers             -0.008335
playingtime             0.048994
minplaytime             0.043985
maxplaytime             0.048994
minage                  0.210049
users_rated             0.112564
average_rating          1.000000
bayes_average_rating    0.231563
total_owners            0.137478
total_traders           0.119452
total_wanters           0.196566
total_wishers           0.171375
total_comments          0.123714
total_weights           0.109691
average_weight          0.351081
Name: average_rating, dtype: float64
  • The 'yearpublished' column is surprisingly positively correlated with average_rating. So most recent games tend to be rated more highly.
  • The more 'minage' is high, the more highly is the score.
  • The more "weighty" a game is (complexity rating of a game), the more highly it tends to be rated.
In [25]:
cols = list(board_games.columns)
cols.remove("average_rating")
cols.remove("bayes_average_rating")
cols.remove("minplayers")
cols.remove("maxplayers")
# not numeric values
cols.remove("name")
cols.remove("id")
cols.remove("type")

I removed useless columns, like 'bayes_average_rating' derivated from 'average_rating'

Linear Regression

In [28]:
from sklearn.linear_model import LinearRegression

# Training
lr = LinearRegression()
lr.fit(board_games[cols], board_games["average_rating"])

# Prediction
predictions = lr.predict(board_games[cols])

from sklearn.metrics import mean_squared_error
import math

mse = mean_squared_error(board_games['average_rating'], predictions)
rmse = math.sqrt(mse)

print(rmse)
1.4479383303003244

The error rate is close to the standard deviation (1.57) of all board game ratings. This indicates that our model may not have high predictive power.